317
investigate the promoter region for transcription factor binding sites (TFBS). Transcription
factors (TFs) recognize and bind to specific DNA motifs (DNA binding sites) in the pro
moter, called TFBSs, and thus regulate transcription. If I know the consensus sequence of
the TFBS (template), i.e. the DNA nucleotides to which the TF binds, I can also easily
bioinformatically investigate an unknown sequence for possible binding sites, which I can
then use for further experimental investigations. Appropriate software is already available
for this purpose. Apart from programs that list experimentally validated TFBS (such as
MotifMap), there are also numerous programs that predict TFBS, e.g. ALGGEN PROMO,
PRODORIC (Prokaryotic Database of Gene Regulation), TESS (Transcription Element
Search System) or Genomatix. It is useful to always use several programs to compare
results and find common TFBS. As these programs disappear so often from the open
accessible internet as they can be commercially used and sold, we recently published
AIModules, which offers TFBS analysis including conserved TFBS modules in different
promotor regions (Aydinli et al., 2022; https://aimodules.heinzelab.de/#/)
A computer program for promoter analyses would first “learn” the TFBS, this is done
using stochastic models, e.g. PSSMs or HMMs. In a further step, the program would then
read in a promoter sequence (read-in part) and then search for similarities with the
consensus sequence found (internal calculation part, e.g. with a BLAST), which are then
in turn output as hits (output part).
Possible challenges and sources of error are, for example, that several DNA sequences
are necessary to create the template, i.e. the more binding sites the training data set con
tains, the more accurately the template can also be trained. Statistical parameters should
also be considered. TFs also often bind to DNA combinatorially at a certain distance from
each other, and there are also other elements that influence transcription, such as enhanc
ers. All these factors and challenges should be taken into account by a program to enable
accurate prediction. In any case, it is advisable to validate bioinformatically predicted
TFBS experimentally. Only then can I be sure that the TF actually has an effect on tran
scription. Otherwise, only the DNA nucleotides of the prediction match (that’s why I got
a hit; false positive hits), but this has no biological relevance.
Example 3.9
C, D (please also look at the previous answers).
ALGGEN PROMO should find numerous TFBS for the example sequence, including
NF-AT2 [T01945].
If something did not work for you, then try it best like this. In ALGGEN PROMO, select
the option “SearchSites” (under Step 2) and copy the sequence into the search window, then
start the search (please make sure that the default “Maximum matrix dissimilarity rate“is set
to 15; this specifies the maximum deviation from the actual DNA nucleotide sequence [tem
plate] of the TFBS that is allowed, you can also change this parameter yourself and observe
what happens). As output you will see all TFBS found, their position and score (under Data
[txt] you can also display a list of the TFBS found and the corresponding TF).
Example 3.10
Hidden Markov models are stochastic probability models that predict hidden system
states (e.g. exon, intron) from a sequence (observations, e.g. ATCCCTG...) using a Markov
20.3 Genomes – Molecular Maps of Living Organisms